{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The blood transfusion dataset\n",
    "\n",
    "In this notebook, we will present the \"blood transfusion\" dataset. This\n",
    "dataset is locally available in the directory `datasets` and it is stored as a\n",
    "comma separated value (CSV) file. We start by loading the entire dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "blood_transfusion = pd.read_csv(\"../datasets/blood_transfusion.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a first look at the at the dataset loaded."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "blood_transfusion.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataframe, we can see that the last column correspond to the target to\n",
    "be predicted called `\"Class\"`. We will create two variables, `data` and\n",
    "`target` to separate the data from which we could learn a predictive model and\n",
    "the `target` that should be predicted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = blood_transfusion.drop(columns=\"Class\")\n",
    "target = blood_transfusion[\"Class\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a first look at the `data` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe four columns. Each record corresponds to a person that intended to\n",
    "give blood. The information stored in each column are:\n",
    "\n",
    "* `Recency`: the time in months since the last time a person intended to give\n",
    "  blood;\n",
    "* `Frequency`: the number of time a person intended to give blood in the past;\n",
    "* `Monetary`: the amount of blood given in the past (in cm\u00b3);\n",
    "* `Time`: the time in months since the first time a person intended to give\n",
    "  blood.\n",
    "\n",
    "Now, let's have a look regarding the type of data that we are dealing in these\n",
    "columns and if any missing values are present in our dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our dataset is made of 748 samples. All features are represented with integer\n",
    "numbers and there is no missing values. We can have a look at each feature\n",
    "distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = data.hist(figsize=(12, 10), bins=30, edgecolor=\"black\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is nothing shocking regarding the distributions. We only observe a high\n",
    "value range for the features `\"Recency\"`, `\"Frequency\"`, and `\"Monetary\"`. It\n",
    "means that we have a few extreme high values for these features.\n",
    "\n",
    "Now, let's have a look at the target that we would like to predict for this\n",
    "task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "target.value_counts(normalize=True).plot.barh()\n",
    "plt.xlabel(\"Number of samples\")\n",
    "_ = plt.title(\"Class distribution\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the target is discrete and contains two categories: whether a\n",
    "person `\"donated\"` or `\"not donated\"` his/her blood. Thus the task to be\n",
    "solved is a classification problem. We should note that the class counts of\n",
    "these two classes is different."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target.value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, ~76% of the samples belong to the class `\"not donated\"`. It is rather\n",
    "important: a classifier that would predict always this `\"not donated\"` class\n",
    "would achieve an accuracy of 76% of good classification without using any\n",
    "information from the data itself. This issue is known as class imbalance. One\n",
    "should take care about the generalization performance metric used to evaluate\n",
    "a model as well as the predictive model chosen itself.\n",
    "\n",
    "Now, let's have a naive analysis to see if there is a link between features\n",
    "and the target using a pair plot representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "_ = sns.pairplot(blood_transfusion, hue=\"Class\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the diagonal plots, we don't see any feature that individually\n",
    "could help at separating the two classes. When looking at a pair of feature,\n",
    "we don't see any striking combinations as well. However, we can note that the\n",
    "`\"Monetary\"` and `\"Frequency\"` features are perfectly correlated: all the data\n",
    "points are aligned on a diagonal.\n",
    "\n",
    "As a conclusion, this dataset would be a challenging dataset: it suffer from\n",
    "class imbalance, correlated features and thus very few features will be\n",
    "available to learn a model, and none of the feature combinations were found to\n",
    "help at predicting."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}